Converting data types
Throughout this chapter, you’ll be working with San Francisco bike share ride data called bike_share_rides
. It contains information on start and end stations of each trip, the trip duration, and some user information.
Before beginning to analyze any dataset, it’s important to take a look at the different types of columns you’ll be working with, which you can do using glimpse()
In this exercise, you’ll take a look at the data types contained in bike_share_rides
and see how an incorrect data type can flaw your analysis.
and assertive
are loaded and bike_share_rides
is available.
bike_share_rides <- readRDS("_data/bike_share_rides_ch1_1.rds")
# Glimpse at bike_share_rides
## Rows: 35,229
## Columns: 10
## $ ride_id <int> 52797, 54540, 87695, 45619, 70832, 96135, 29928, 83...
## $ date <chr> "2017-04-15", "2017-04-19", "2017-04-14", "2017-04-...
## $ duration <chr> "1316.15 minutes", "8.13 minutes", "24.85 minutes",...
## $ station_A_id <dbl> 67, 21, 16, 58, 16, 6, 5, 16, 5, 81, 30, 16, 16, 67...
## $ station_A_name <chr> "San Francisco Caltrain Station 2 (Townsend St at ...
## $ station_B_id <dbl> 89, 64, 355, 368, 81, 66, 350, 91, 62, 81, 109, 10,...
## $ station_B_name <chr> "Division St at Potrero Ave", "5th St at Brannan St...
## $ bike_id <dbl> 1974, 860, 2263, 1417, 507, 75, 388, 239, 1449, 328...
## $ user_gender <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Ma...
## $ user_birth_year <dbl> 1972, 1986, 1993, 1981, 1981, 1988, 1993, 1996, 199...
# Summary of user_birth_year
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1900 1979 1986 1984 1991 2001
# Convert user_birth_year to factor: user_birth_year_fct
bike_share_rides <- bike_share_rides %>%
mutate(user_birth_year_fct = as.factor(user_birth_year))
# Assert user_birth_year_fct is a factor
# Summary of user_birth_year_fct
## 1900 1902 1923 1931 1938 1939 1941 1942 1943 1945 1946 1947 1948 1949 1950 1951
## 1 7 2 23 2 1 3 10 4 16 5 24 9 30 37 25
## 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967
## 70 49 65 66 112 62 156 99 196 161 256 237 245 349 225 363
## 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983
## 365 331 370 548 529 527 563 601 481 541 775 876 825 1016 1056 1262
## 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999
## 1157 1318 1606 1672 2135 1872 2062 1582 1703 1498 1476 1185 813 358 365 348
## 2000 2001
## 473 30
Dapper data type dexterity! Looking at the new summary statistics, more riders were born in 1988
than any other year.
Trimming strings
In the previous exercise, you were able to identify the correct data type and convert user_birth_year
to the correct type, allowing you to extract counts that gave you a bit more insight into the dataset.
Another common dirty data problem is having extra bits like percent signs or periods in numbers, causing them to be read in as characters
. In order to be able to crunch these numbers, the extra bits need to be removed and the numbers need to be converted from character
to numeric
. In this exercise, you’ll need to convert the duration
column from character
to numeric
, but before this can happen, the word "minutes"
needs to be removed from each value.
, assertive
, and stringr
are loaded and bike_share_rides
is available.
bike_share_rides <- bike_share_rides %>%
# Remove 'minutes' from duration: duration_trimmed
mutate(duration_trimmed = str_remove(duration, "minutes"),
# Convert duration_trimmed to numeric: duration_mins
duration_mins = as.numeric(duration_trimmed))
# Glimpse at bike_share_rides
## Rows: 35,229
## Columns: 13
## $ ride_id <int> 52797, 54540, 87695, 45619, 70832, 96135, 29928...
## $ date <chr> "2017-04-15", "2017-04-19", "2017-04-14", "2017...
## $ duration <chr> "1316.15 minutes", "8.13 minutes", "24.85 minut...
## $ station_A_id <dbl> 67, 21, 16, 58, 16, 6, 5, 16, 5, 81, 30, 16, 16...
## $ station_A_name <chr> "San Francisco Caltrain Station 2 (Townsend St...
## $ station_B_id <dbl> 89, 64, 355, 368, 81, 66, 350, 91, 62, 81, 109,...
## $ station_B_name <chr> "Division St at Potrero Ave", "5th St at Branna...
## $ bike_id <dbl> 1974, 860, 2263, 1417, 507, 75, 388, 239, 1449,...
## $ user_gender <chr> "Male", "Male", "Male", "Male", "Male", "Male",...
## $ user_birth_year <dbl> 1972, 1986, 1993, 1981, 1981, 1988, 1993, 1996,...
## $ user_birth_year_fct <fct> 1972, 1986, 1993, 1981, 1981, 1988, 1993, 1996,...
## $ duration_trimmed <chr> "1316.15 ", "8.13 ", "24.85 ", "6.35 ", "9.8 ",...
## $ duration_mins <dbl> 1316.15, 8.13, 24.85, 6.35, 9.80, 17.47, 16.52,...
# Assert duration_mins is numeric
# Calculate mean duration
## [1] 13.06214
Great work! By removing characters and converting to a numeric type, you were able to figure out that the average ride duration is about 13 minutes - not bad for a city like San Francisco!
What’s an out of range value?
Handling out of range values
)Ride duration constraints
Values that are out of range can throw off an analysis, so it’s important to catch them early on. In this exercise, you’ll be examining the duration_min
column more closely. Bikes are not allowed to be kept out for more than 24 hours
(, or 1440 minutes at a time, but issues with some of the bikes caused inaccurate recording of the time they were returned.
In this exercise, you’ll replace erroneous data with the range limit (1440 minutes), however, you could just as easily replace these values with NA
, assertive
, and ggplot2
are loaded and bike_share_rides
is available.
# Create breaks
breaks <- c(min(bike_share_rides$duration_mins), 0, 1440, max(bike_share_rides$duration_mins))
# Create a histogram of duration_min
ggplot(bike_share_rides, aes(duration_mins)) +
geom_histogram(breaks = breaks)
# Create breaks
breaks <- c(min(bike_share_rides$duration_mins), 0, 1440, max(bike_share_rides$duration_mins))
# Create a histogram of duration_min
ggplot(bike_share_rides, aes(duration_mins)) +
geom_histogram(breaks = breaks)
# duration_min_const: replace vals of duration_min > 1440 with 1440
bike_share_rides <- bike_share_rides %>%
mutate(duration_min_const = replace(duration_mins, duration_mins > 1440, 1440))
# Make sure all values of duration_min_const are between 0 and 1440
assert_all_are_in_closed_range(bike_share_rides$duration_min_const, lower = 0, upper = 1440)
Radical replacing! The method of replacing erroneous data with the range limit works well, but you could just as easily replace these values with NA
s or something else instead.
Back to the future
Something has gone wrong and it looks like you have data with dates from the future, which is way outside of the date range you expected to be working with. To fix this, you’ll need to remove any rides from the dataset that have a date
in the future. Before you can do this, the date
column needs to be converted from a character to a Date
. Having these as Date
objects will make it much easier to figure out which rides are from the future, since R makes it easy to check if one Date
object is before (<
) or after (>
) another.
and assertive
are loaded and bike_share_rides
is available.
# Convert date to Date type
bike_share_rides <- bike_share_rides %>%
mutate(date = as.Date(date))
# Make sure all dates are in the past
## Warning: Coercing bike_share_rides$date to class 'POSIXct'.
# Filter for rides that occurred before or on today's date
bike_share_rides_past <- bike_share_rides %>%
filter(date <= today())
# Make sure all dates from bike_share_rides_past are in the past
## Warning: Coercing bike_share_rides_past$date to class 'POSIXct'.
Fabulous filtering! Handling data from the future like this is much easier than trying to verify the data’s correctness by time traveling.
Full duplicates
You’ve been notified that an update has been made to the bike sharing data pipeline to make it more efficient, but that duplicates are more likely to be generated as a result. To make sure that you can continue using the same scripts to run your weekly analyses about ride statistics, you’ll need to ensure that any duplicates in the dataset are removed first.
When multiple rows of a data frame share the same values for all columns, they’re full duplicates
of each other. Removing duplicates like this is important, since having the same value repeated multiple times can alter summary statistics like the mean and median. Each ride, including its ride_id
should be unique.
is loaded and bike_share_rides
is available.
# Count the number of full duplicates
## [1] 0
# Remove duplicates
bike_share_rides_unique <- distinct(bike_share_rides)
# Count the full duplicates in bike_share_rides_unique
## [1] 0
Dazzling duplicate removal! Removing full duplicates will ensure that summary statistics aren’t altered by repeated data points.
Removing partial duplicates
Now that you’ve identified and removed the full duplicates, it’s time to check for partial duplicates. Partial duplicates are a bit tricker to deal with than full duplicates. In this exercise, you’ll first identify any partial duplicates and then practice the most common technique to deal with them, which involves dropping all partial duplicates, keeping only the first.
is loaded and bike_share_rides
is available.
# Find duplicated ride_ids
bike_share_rides %>%
# Count the number of occurrences of each ride_id
count(ride_id) %>%
# Filter for rows with a count > 1
filter(n > 1)
## # A tibble: 0 x 2
## # ... with 2 variables: ride_id <int>, n <int>
# Remove full and partial duplicates
bike_share_rides_unique <- bike_share_rides %>%
# Only based on ride_id instead of all cols
distinct(ride_id, .keep_all = TRUE)
# Find duplicated ride_ids in bike_share_rides_unique
bike_share_rides_unique %>%
# Count the number of occurrences of each ride_id
count(ride_id) %>%
# Filter for rows with a count > 1
filter(n > 1)
## # A tibble: 0 x 2
## # ... with 2 variables: ride_id <int>, n <int>
Perfect partial duplicate removing! It’s important to consider the data you’re working with before removing partial duplicates, since sometimes it’s expected that there will be partial duplicates in a dataset, such as if the same customer makes multiple purchases.
Aggregating partial duplicates
Another way of handling partial duplicates is to compute a summary statistic of the values that differ between partial duplicates, such as mean, median, maximum, or minimum. This can come in handy when you’re not sure how your data was collected and want an average, or if based on domain knowledge, you’d rather have too high of an estimate than too low of an estimate (or vice versa).
is loaded and bike_share_rides
is available.
bike_share_rides %>%
# Group by ride_id and date
group_by(ride_id, date) %>%
# Add duration_min_avg column
mutate(duration_min_avg = mean(duration_mins)) %>%
# Remove duplicates based on ride_id and date, keep all cols
distinct(ride_id, date, .keep_all = TRUE) %>%
# Remove duration_min column
## # A tibble: 35,229 x 14
## # Groups: ride_id, date [35,229]
## ride_id date duration station_A_id station_A_name station_B_id
## <int> <date> <chr> <dbl> <chr> <dbl>
## 1 52797 2017-04-15 1316.15~ 67 San Francisco~ 89
## 2 54540 2017-04-19 8.13 mi~ 21 Montgomery St~ 64
## 3 87695 2017-04-14 24.85 m~ 16 Steuart St at~ 355
## 4 45619 2017-04-03 6.35 mi~ 58 Market St at ~ 368
## 5 70832 2017-04-10 9.8 min~ 16 Steuart St at~ 81
## 6 96135 2017-04-18 17.47 m~ 6 The Embarcade~ 66
## 7 29928 2017-04-22 16.52 m~ 5 Powell St BAR~ 350
## 8 83331 2017-04-11 14.72 m~ 16 Steuart St at~ 91
## 9 72424 2017-04-05 4.12 mi~ 5 Powell St BAR~ 62
## 10 25910 2017-04-20 25.77 m~ 81 Berry St at 4~ 81
## # ... with 35,219 more rows, and 8 more variables: station_B_name <chr>,
## # bike_id <dbl>, user_gender <chr>, user_birth_year <dbl>,
## # user_birth_year_fct <fct>, duration_trimmed <chr>,
## # duration_min_const <dbl>, duration_min_avg <dbl>
Awesome aggregation! Aggregation of partial duplicates allows you to keep some information about all data points instead of keeping information about just one data point.
Not a member
Now that you’ve practiced identifying membership constraint problems, it’s time to fix these problems in a new dataset. Throughout this chapter, you’ll be working with a dataset called sfo_survey
, containing survey responses from passengers taking flights from San Francisco International Airport (SFO). Participants were asked questions about the airport’s cleanliness, wait times, safety, and their overall satisfaction.
There were a few issues during data collection that resulted in some inconsistencies in the dataset. In this exercise, you’ll be working with the dest_size
column, which categorizes the size of the destination airport that the passengers were flying to. A data frame called dest_sizes
is available that contains all the possible destination sizes. Your mission is to find rows with invalid dest_sizes
and remove them from the data frame.
has been loaded and sfo_survey
and dest_sizes
are available.
sfo_survey <- readRDS("_data/sfo_survey_ch2_1.rds")
dest_size <- c("Small", "Medium", "Large", "Hub")
passengers_per_day <- c("0-20K", "20K-70K", "70K-100K", "100K+")
dest_sizes <- data.frame(cbind(dest_size, passengers_per_day))
dest_sizes$passengers_per_day <- as.factor(dest_sizes$passengers_per_day)
# Count the number of occurrences of dest_size
sfo_survey %>%
## dest_size n
## 1 Small 1
## 2 Hub 1
## 3 Hub 1756
## 4 Large 143
## 5 Large 1
## 6 Medium 682
## 7 Small 225
# Find bad dest_size rows
sfo_survey %>%
# Join with dest_sizes data frame to get bad dest_size rows
anti_join(dest_sizes) %>%
# Select id, airline, destination, and dest_size cols
select(id, airline, destination, dest_size)
## Joining, by = "dest_size"
## id airline destination dest_size
# Remove bad dest_size rows
sfo_survey %>%
# Join with dest_sizes
semi_join(dest_sizes) %>%
# Count the number of each dest_size
## Joining, by = "dest_size"
## dest_size n
## 1 Hub 1756
## 2 Large 143
## 3 Medium 682
## 4 Small 225
Great joining! Anti-joins can help you identify the rows that are causing issues, and semi-joins can remove the issue-causing rows. In the next lesson, you’ll learn about other ways to deal with bad values so that you don’t have to lose rows of data.
Identifying inconsistency
In the video exercise, you learned about different kinds of inconsistencies that can occur within categories, making it look like a variable has more categories than it should.
In this exercise, you’ll continue working with the sfo_survey
dataset. You’ll examine the dest_size
column again as well as the cleanliness
column and determine what kind of issues, if any, these two categorical variables face.
is loaded and sfo_survey
is available.
# Count dest_size
sfo_survey %>%
## dest_size n
## 1 Small 1
## 2 Hub 1
## 3 Hub 1756
## 4 Large 143
## 5 Large 1
## 6 Medium 682
## 7 Small 225
# Count cleanliness
sfo_survey %>%
## cleanliness n
## 1 Average 433
## 2 Clean 970
## 3 Dirty 2
## 4 Somewhat clean 1254
## 5 Somewhat dirty 30
## 6 <NA> 120
Correcting inconsistency
Now that you’ve identified that dest_size
has whitespace inconsistencies and cleanliness
has capitalization inconsistencies, you’ll use the new tools at your disposal to fix the inconsistent values in sfo_survey
instead of removing the data points entirely, which could add bias to your dataset if more than 5% of the data points need to be dropped.
and stringr
are loaded and sfo_survey
is available.
# Add new columns to sfo_survey
sfo_survey <- sfo_survey %>%
# dest_size_trimmed: dest_size without whitespace
mutate(dest_size_trimmed = str_trim(dest_size),
# cleanliness_lower: cleanliness converted to lowercase
cleanliness_lower = str_to_lower(cleanliness))
# Count values of dest_size_trimmed
sfo_survey %>%
## dest_size_trimmed n
## 1 Hub 1757
## 2 Large 144
## 3 Medium 682
## 4 Small 226
# Count values of cleanliness_lower
sfo_survey %>%
## cleanliness_lower n
## 1 average 433
## 2 clean 970
## 3 dirty 2
## 4 somewhat clean 1254
## 5 somewhat dirty 30
## 6 <NA> 120
Lovely lowercase conversion and terrific trimming! You were able to convert seven-category data into four-category data, which will help your analysis go more smoothly.
Collapsing categories
One of the tablets that participants filled out the sfo_survey
on was not properly configured, allowing the response for dest_region
to be free text instead of a dropdown menu. This resulted in some inconsistencies in the dest_region
variable that you’ll need to correct in this exercise to ensure that the numbers you report to your boss are as accurate as possible.
and forcats
are loaded and sfo_survey
is available.
# Count categories of dest_region
sfo_survey %>%
## dest_region n
## 1 Asia 260
## 2 Australia/New Zealand 66
## 3 Canada/Mexico 220
## 4 Central/South America 29
## 5 East US 498
## 6 Europe 401
## 7 Middle East 79
## 8 Midwest US 281
## 9 West US 975
# Categories to map to Europe
europe_categories <- c("EU", "eur", "Europ")
# Add a new col dest_region_collapsed
sfo_survey %>%
# Map all categories in europe_categories to Europe
mutate(dest_region_collapsed = fct_collapse(dest_region,
Europe = europe_categories)) %>%
# Count categories of dest_region_collapsed
## Warning: Problem with `mutate()` input `dest_region_collapsed`.
## i Unknown levels in `f`: EU, eur, Europ
## i Input `dest_region_collapsed` is `fct_collapse(dest_region, Europe = europe_categories)`.
## Warning: Unknown levels in `f`: EU, eur, Europ
## dest_region_collapsed n
## 1 Asia 260
## 2 Australia/New Zealand 66
## 3 Canada/Mexico 220
## 4 Central/South America 29
## 5 East US 498
## 6 Europe 401
## 7 Middle East 79
## 8 Midwest US 281
## 9 West US 975
Clean collapsing! You’ve reduced the number of categories from 12 to 9, and you can now be confident that 401 of the survey participants were heading to Europe.
Detecting inconsistent text data
You’ve recently received some news that the customer support team wants to ask the SFO survey participants some follow-up questions. However, the auto-dialer that the call center uses isn’t able to parse all of the phone numbers since they’re all in different formats. After some investigation, you found that some phone numbers are written with hyphens (-
) and some are written with parentheses ((
). In this exercise, you’ll figure out which phone numbers have these issues so that you know which ones need fixing.
and stringr
are loaded, and sfo_survey
is available.
# Filter for rows with "-" in the phone column
sfo_survey %>%
filter(str_detect(phone, "-"))
# Filter for rows with "(" or ")" in the phone column
sfo_survey %>%
filter(str_detect(phone, fixed("(")) | str_detect(phone, fixed(")")))
Delightful detection! Now that you’ve identified the inconsistencies in the phone
column, it’s time to remove unnecessary characters to make the follow-up survey go as smoothly as possible.
Replacing and removing
In the last exercise, you saw that the phone
column of sfo_data
is plagued with unnecessary parentheses and hyphens. The customer support team has requested that all phone numbers be in the format "123 456 7890"
. In this exercise, you’ll use your new stringr
skills to fulfill this request.
and stringr
are loaded and sfo_survey
is available.
# Remove parentheses from phone column
phone_no_parens <- sfo_survey$phone %>%
# Remove "("s
str_remove_all(fixed("(")) %>%
# Remove ")"s
# Add phone_no_parens as column
sfo_survey %>%
mutate(phone_no_parens = phone_no_parens)
# Add phone_no_parens as column
sfo_survey %>%
mutate(phone_no_parens = phone_no_parens,
# Replace all hyphens in phone_no_parens with spaces
phone_clean = str_replace_all(phone_no_parens, "-", " "))
Radical replacing and removing! Now that your phone numbers are all in a single format, the machines in the call center will be able to auto-dial the numbers, making it easier to ask participants follow-up questions.
Invalid phone numbers
The customer support team is grateful for your work so far, but during their first day of calling participants, they ran into some phone numbers that were invalid. In this exercise, you’ll remove any rows with invalid phone numbers so that these faulty numbers don’t keep slowing the team down.
and stringr
are loaded and sfo_survey
is available.
# Check out the invalid numbers
sfo_survey %>%
filter(str_length(phone) != 12)
# Remove rows with invalid numbers
sfo_survey %>%
filter(str_length(phone) == 12) %>% nrow()
sfo_survey %>%
filter(str_length(phone) != 12)
id airline destination phone
2 3081 COPA PANAMA CITY 925 8846
[1] 2804
Mission accomplished! Thanks to your savvy string skills, the follow-up survey will be done in no time!
Date uniformity
In this chapter, you work at an asset management company and you’ll be working with the accounts
dataset, which contains information about each customer, the amount in their account, and the date their account was opened. Your boss has asked you to calculate some summary statistics about the average value of each account and whether the age of the account is associated with a higher or lower account value. Before you can do this, you need to make sure that the accounts
dataset you’ve been given doesn’t contain any uniformity problems. In this exercise, you’ll investigate the date_opened
column and clean it up so that all the dates are in the same format.
and lubridate
are loaded and accounts
is available.
accounts <- readRDS("_data/ch3_1_accounts.rds")
# Check out the accounts data frame
## id date_opened total
## 1 A880C79F 2003-10-19 169305
## 2 BE8222DF October 05, 2018 107460
## 3 19F9E113 2008-07-29 15297152
## 4 A2FE52A3 2005-06-09 14897272
## 5 F6DC2C08 2012-03-31 124568
## 6 D2E55799 2007-06-20 13635752
# Define the date formats
formats <- c("%Y-%m-%d", "%B %d, %Y")
# Convert dates to the same format
accounts <- accounts %>%
mutate(date_opened_clean = parse_date_time(date_opened, orders = formats))
## id date_opened total date_opened_clean
## 1 A880C79F 2003-10-19 169305 2003-10-19
## 2 BE8222DF October 05, 2018 107460 2018-10-05
## 3 19F9E113 2008-07-29 15297152 2008-07-29
## 4 A2FE52A3 2005-06-09 14897272 2005-06-09
## 5 F6DC2C08 2012-03-31 124568 2012-03-31
## 6 D2E55799 2007-06-20 13635752 2007-06-20
## 7 53AE87EF December 01, 2017 15375984 2017-12-01
## 8 3E97F253 2019-06-03 14515800 2019-06-03
## 9 4AE79EA1 2011-05-07 23338536 2011-05-07
## 10 2322DFB4 2018-04-07 189524 2018-04-07
## 11 645335B2 2018-11-16 154001 2018-11-16
## 12 D5EB0F00 2001-04-16 174576 2001-04-16
## 13 1EB593F7 2005-04-21 191989 2005-04-21
## 14 DDBA03D9 2006-06-13 9617192 2006-06-13
## 15 40E4A2F4 2009-01-07 180547 2009-01-07
## 16 39132EEA 2012-07-07 15611960 2012-07-07
## 17 387F8E4D January 03, 2011 9402640 2011-01-03
## 18 11C3C3C0 December 24, 2017 180003 2017-12-24
## 19 C2FC91E1 2004-05-21 105722 2004-05-21
## 20 FB8F01C1 2001-09-06 22575072 2001-09-06
## 21 0128D2D0 2005-04-09 19179784 2005-04-09
## 22 BE6E4B3F 2009-10-20 15679976 2009-10-20
## 23 7C6E2ECC 2003-05-16 169814 2003-05-16
## 24 02E63545 2015-10-25 125117 2015-10-25
## 25 4399C98B May 19, 2001 130421 2001-05-19
## 26 98F4CF0F May 27, 2014 14893944 2014-05-27
## 27 247222A6 May 26, 2015 150372 2015-05-26
## 28 420985EE 2008-12-27 123125 2008-12-27
## 29 0E3903BA 2015-11-11 182668 2015-11-11
## 30 64EF994F 2009-02-26 161141 2009-02-26
## 31 CCF84EDB 2008-12-26 136128 2008-12-26
## 32 51C21705 April 22, 2016 16191136 2016-04-22
## 33 C868C6AD January 31, 2000 11733072 2000-01-31
## 34 92C237C6 2005-12-13 11838528 2005-12-13
## 35 9ECEADB2 May 17, 2018 146153 2018-05-17
## 36 DF0AFE50 2004-12-03 15250040 2004-12-03
## 37 5CD605B3 2016-10-19 87921 2016-10-19
## 38 402839E2 September 14, 2019 163416 2019-09-14
## 39 78286CE7 2009-10-05 15049216 2009-10-05
## 40 168E071B 2013-07-11 87826 2013-07-11
## 41 466CCDAA 2002-03-24 14981304 2002-03-24
## 42 8DE1ECB9 2015-10-17 217975 2015-10-17
## 43 E19FE6B5 June 06, 2009 101936 2009-06-06
## 44 1240D39C September 07, 2011 15761824 2011-09-07
## 45 A7BFAA72 2019-11-12 133790 2019-11-12
## 46 C3D24436 May 24, 2002 101584 2002-05-24
## 47 FAD92F0F September 13, 2007 17081064 2007-09-13
## 48 236A1D51 2019-10-01 18486936 2019-10-01
## 49 A6DDDC4C 2000-08-17 67962 2000-08-17
## 50 DDFD0B3D 2001-04-11 15776384 2001-04-11
## 51 D13375E9 November 01, 2005 13944632 2005-11-01
## 52 AC50B796 2016-06-30 16111264 2016-06-30
## 53 290319FD May 27, 2005 170178 2005-05-27
## 54 FC71925A November 02, 2006 186281 2006-11-02
## 55 7B0F3685 2013-05-23 179102 2013-05-23
## 56 BE411172 2017-02-24 17689984 2017-02-24
## 57 58066E39 September 16, 2015 17025632 2015-09-16
## 58 EA7FF83A 2004-11-02 11598704 2004-11-02
## 59 14A2DDB7 2019-03-06 12808952 2019-03-06
## 60 305EEAA8 2018-09-01 14417728 2018-09-01
## 61 8F25E54C November 24, 2008 189126 2008-11-24
## 62 19DD73C6 2002-12-31 14692600 2002-12-31
## 63 ACB8E6AF 2013-07-27 71359 2013-07-27
## 64 91BFCC40 2014-01-10 132859 2014-01-10
## 65 86ACAF81 2011-12-14 24533704 2011-12-14
## 66 77E85C14 November 20, 2009 13868192 2009-11-20
## 67 C5C6B79D 2008-03-01 188424 2008-03-01
## 68 0E5B69F5 2018-05-07 18650632 2018-05-07
## 69 5275B518 2017-11-23 71665 2017-11-23
## 70 17217048 May 25, 2001 20111208 2001-05-25
## 71 E7496A7F 2008-09-27 142669 2008-09-27
## 72 41BBB7B4 February 22, 2005 144229 2005-02-22
## 73 F6C7ABA1 2008-01-07 183440 2008-01-07
## 74 E699DF01 February 17, 2008 199603 2008-02-17
## 75 BACA7378 2005-05-11 204271 2005-05-11
## 76 84A4302F 2003-08-12 19420648 2003-08-12
## 77 F8A78C27 April 05, 2006 41164 2006-04-05
## 78 8BADDF6A December 31, 2010 158203 2010-12-31
## 79 9FB57E68 September 01, 2017 216352 2017-09-01
## 80 5C98E8F5 2014-11-25 103200 2014-11-25
## 81 6BB53C2A December 03, 2016 146394 2016-12-03
## 82 E23F2505 October 15, 2017 121614 2017-10-15
## 83 0C121914 June 21, 2017 227729 2017-06-21
## 84 3627E08A 2008-04-01 238104 2008-04-01
## 85 A94493B3 August 01, 2009 85975 2009-08-01
## 86 0682E9DE 2002-10-01 72832 2002-10-01
## 87 49931170 2011-03-25 14519856 2011-03-25
## 88 A154F63B 2000-07-11 133800 2000-07-11
## 89 3690CCED 2014-10-19 226595 2014-10-19
## 90 48F5E6D8 February 16, 2020 135435 2020-02-16
## 91 515FAD84 2013-06-20 98190 2013-06-20
## 92 59794264 2008-01-16 157964 2008-01-16
## 93 2038185B 2016-06-24 194662 2016-06-24
## 94 65EAC615 February 20, 2004 140191 2004-02-20
## 95 6C7509C9 September 16, 2000 212089 2000-09-16
## 96 BD969A9D 2007-04-29 167238 2007-04-29
## 97 B0CDCE3D May 28, 2014 145240 2014-05-28
## 98 33A7F03E October 14, 2007 191839 2007-10-14
Cunning calendar cleaning! Now that the date_opened
dates are in the same format, you’ll be able to use them for some plotting in the next exercise.
Currency uniformity
Now that your dates are in order, you’ll need to correct any unit differences. When you first plot the data, you’ll notice that there’s a group of very high values, and a group of relatively lower values. The bank has two different offices - one in New York, and one in Tokyo, so you suspect that the accounts managed by the Tokyo office are in Japanese yen instead of U.S. dollars. Luckily, you have a data frame called account_offices
that indicates which office manages each customer’s account, so you can use this information to figure out which totals need to be converted from yen to dollars.
The formula to convert yen to dollars is USD = JPY / 104
and ggplot2
are loaded and the accounts and account_offices
data frames are available.
office <- as.character(factor(c(1,1,2,2,1,2,2,2,2,1,1,1,1,2,1,2,2,1,1,2,2,2,1,1,1,2,1,1,1,1,1,2,2,2,1,2,1,1,2,1,2,1,1,2,1,1,2,2,1,2,2,2,1,1,1,2, 2,2,2,2,1,2,1,1,2,2,1,2,1,2,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1), levels = c(1, 2), labels = c("New York", "Tokyo")))
account_offices <- data.frame(id = accounts$id, office)
# Scatter plot of opening date and total amount
accounts %>%
ggplot(aes(x = date_opened_clean, y = total)) +
# Left join accounts and account_offices by id
accounts %>%
left_join(account_offices, by = "id") %>%
# Convert totals from the Tokyo office to USD
mutate(total_usd = ifelse(office == "Tokyo", total/104, total)) %>%
# Scatter plot of opening date vs total_usd
ggplot(aes(x = date_opened_clean, y = total_usd)) +
Crafty currency conversion! The points in your last scatter plot all fall within a much smaller range now and you’ll be able to accurately assess the differences between accounts from different countries.
Validating totals
In this lesson, you’ll continue to work with the accounts
data frame, but this time, you have a bit more information about each account. There are three different funds that account holders can store their money in. In this exercise, you’ll validate whether the total
amount in each account is equal to the sum of the amount in fund_A
, fund_B
, and fund_C
. If there are any accounts that don’t match up, you can look into them further to see what went wrong in the bookkeeping that led to inconsistencies.
is loaded and accounts
is available.
# Find invalid totals
accounts %>%
# theoretical_total: sum of the three funds
mutate(theoretical_total = fund_A + fund_B + fund_C) %>%
# Find accounts where total doesn't match theoretical_total
filter(theoretical_total != total_usd)
id date_opened total fund_A fund_B fund_C acct_age theoretical_total
1 D5EB0F00 2001-04-16 130920 69487 48681 56408 19 174576
2 92C237C6 2005-12-13 85362 72556 21739 19537 15 113832
3 0E5B69F5 2018-05-07 134488 88475 44383 46475 2 179333
Great job! By using cross field validation, you’ve been able to detect values that don’t make sense. How you choose to handle these values will depend on the dataset.
Validating age
Now that you found some inconsistencies in the total
amounts, you’re suspicious that there may also be inconsistencies in the acct_age
column, and you want to see if these inconsistencies are related. Using the skills you learned from the video exercise, you’ll need to validate the age of each account and see if rows with inconsistent acct_age
s are the same ones that had inconsistent total
and lubridate
are loaded, and accounts
is available.
# Find invalid acct_age
accounts %>%
# theoretical_age: age of acct based on date_opened
mutate(theoretical_age = floor(as.numeric(date_opened %--% today(), "years"))) %>%
# Filter for rows where acct_age is different from theoretical_age
filter(acct_age != theoretical_age)
id date_opened total fund_A fund_B fund_C acct_age theoretical_age
1 11C3C3C0 2017-12-24 180003 84295 31591 64117 1 2
2 EA7FF83A 2004-11-02 111526 86856 19406 5264 15 16
3 3627E08A 2008-04-01 238104 60475 89011 88618 11 12
Vigorous validating! There are three accounts that all have ages off by one year, but none of them are the same as the accounts that had total
inconsistencies, so it looks like these two bookkeeping errors may not be related.
Visualizing missing data
Dealing with missing data is one of the most common tasks in data science. There are a variety of types of missingness, as well as a variety of types of solutions to missing data.
You just received a new version of the accounts
data frame containing data on the amount held and amount invested for new and existing customers. However, there are rows with missing inv_amount
You know for a fact that most customers below 25 do not have investment accounts yet, and suspect it could be driving the missingness. The dplyr
and visdat
packages have been loaded and accounts
is available.
# Visualize the missing values by column
accounts %>%
# missing_inv: Is inv_amount missing?
mutate(missing_inv = %>%
# Group by missing_inv
group_by(missing_inv) %>%
# Calculate mean age for each missing_inv group
summarize(avg_age = mean(age))
# A tibble: 97 x 8
# Groups: missing_inv [2]
cust_id age acct_amount inv_amount account_opened last_transaction
<fct> <int> <dbl> <dbl> <fct> <fct>
1 8C3554~ 54 44245. 35500. 03-05-18 30-09-19
2 D55366~ 36 86507. 81922. 21-01-18 14-01-19
3 A63198~ 49 77799. 46412. 26-01-18 06-10-19
4 93F2F9~ 56 93875. 76563. 21-08-17 10-07-19
5 DE0A08~ 21 99998. NA 05-06-17 15-01-19
6 25E68E~ 47 109738. 93553. 26-12-17 12-11-18
7 3FA929~ 53 79744. 70358. 21-06-18 24-08-18
8 984403~ 29 17940. 14430. 07-10-17 18-05-18
9 870A92~ 58 63523. 51297. 02-09-18 22-02-19
10 166B05~ 53 38175. 15053. 28-02-19 31-10-18
# ... with 87 more rows, and 2 more variables: missing_inv <lgl>, avg_age <dbl>
Since the average age for TRUE
is 22 and the average age for FALSE
missing_inv is 44, it is likely that the inv_amount
variable is missing mostly in young customers.
# Sort by age and visualize missing vals
accounts %>%
arrange(age) %>%
Fabulous visualizations! Investigating summary statistics based on missingness is a great way to determine if data is missing completely at random or missing at random.
Treating missing data
In this exercise, you’re working with another version of the accounts
data that contains missing values for both the cust_id
and acct_amount
You want to figure out how many unique customers the bank has, as well as the average amount held by customers. You know that rows with missing cust_id
don’t really help you, and that on average, the acct_amount
is usually 5 times the amount of inv_amount
In this exercise, you will drop rows of accounts
with missing cust_id
s, and impute missing values of inv_amount
with some domain knowledge. dplyr
and assertive
are loaded and accounts
is available.
# Create accounts_clean
accounts_clean <- accounts %>%
# Filter to remove rows with missing cust_id
filter(! %>%
# Add new col acct_amount_filled with replaced NAs
mutate(acct_amount_filled = ifelse(, inv_amount * 5, acct_amount))
# Assert that cust_id has no missing vals
# Assert that acct_amount_filled has no missing vals
Great job! Since your assertions passed, there’s no missing data left, and you can definitely bank on nailing your analysis!
Types of edit distance
Which is best?
Small distance, small difference
In the video exercise, you learned that there are multiple ways to calculate how similar or different two strings are. Now you’ll practice using the stringdist
package to compute string distances using various methods. It’s important to be familiar with different methods, as some methods work better on certain datasets, while others work better on other datasets.
The stringdist
package has been loaded for you.
# Calculate Damerau-Levenshtein distance
stringdist("las angelos", "los angeles", method = "dl")
## [1] 2
# Calculate LCS distance
stringdist("las angelos", "los angeles", method = "lcs")
## [1] 4
# Calculate Jaccard distance
stringdist("las angelos", "los angeles", method = "jaccard")
## [1] 0
Superb stringdist()
skills! In the next exercise, you’ll use Damerau-Levenshtein distance to map typo-ridden cities to their true spellings.
Fixing typos with string distance
In this chapter, one of the datasets you’ll be working with, zagat
, is a set of restaurants in New York, Los Angeles, Atlanta, San Francisco, and Las Vegas. The data is from Zagat, a company that collects restaurant reviews, and includes the restaurant names, addresses, phone numbers, as well as other restaurant information.
The city
column contains the name of the city that the restaurant is located in. However, there are a number of typos throughout the column. Your task is to map each city
to one of the five correctly-spelled cities contained in the cities
data frame.
and fuzzyjoin
are loaded, and zagat
and cities
are available.
zagat <- readRDS("_data/zagat.rds")
cities <- data.frame(city_actual = as.factor(c("new york", "los angeles", "atlanta", "san francisco", "las vegas")))
# Count the number of each city variation
zagat %>%
## city n
## 1 atlanta 64
## 2 los angeles 72
## 3 new york 98
## 4 las vegas 26
## 5 san francisco 50
# Join zagat and cities and look at results
zagat %>%
# Left join based on stringdist using city and city_actual cols
stringdist_left_join(cities, by = c("city" = "city_actual")) %>%
# Select the name, city, and city_actual cols
select(name, city, city_actual)
## name city city_actual
## 1 apple pan the los angeles los angeles
## 2 asahi ramen los angeles los angeles
## 3 baja fresh los angeles los angeles
## 4 belvedere the los angeles los angeles
## 5 benita's frites los angeles los angeles
## 6 bernard's los angeles los angeles
## 7 bistro 45 los angeles los angeles
## 8 brighton coffee shop los angeles los angeles
## 9 bristol farms market cafe los angeles los angeles
## 10 cafe'50s los angeles los angeles
## 11 cafe blanc los angeles los angeles
## 12 cassell's los angeles los angeles
## 13 diaghilev los angeles los angeles
## 14 don antonio's los angeles los angeles
## 15 duke's los angeles los angeles
## 16 falafel king los angeles los angeles
## 17 feast from the east los angeles los angeles
## 18 gumbo pot the los angeles los angeles
## 19 indo cafe los angeles los angeles
## 20 jan's family restaurant los angeles los angeles
## 21 jiraffe los angeles los angeles
## 22 jody maroni's sausage kingdom los angeles los angeles
## 23 joe's los angeles los angeles
## 24 john o ` groats los angeles los angeles
## 25 johnny rockets ( la ) los angeles los angeles
## 26 killer shrimp los angeles los angeles
## 27 kokomo cafe los angeles los angeles
## 28 koo koo roo los angeles los angeles
## 29 la salsa ( la ) los angeles los angeles
## 30 langer's los angeles los angeles
## 31 local nochol los angeles los angeles
## 32 mani's bakery & espresso bar los angeles los angeles
## 33 michael's ( los angeles ) los angeles los angeles
## 34 mishima los angeles los angeles
## 35 mo better meatty meat los angeles los angeles
## 36 mulberry st. los angeles los angeles
## 37 ocean park cafe los angeles los angeles
## 38 original pantry bakery los angeles los angeles
## 39 parkway grill los angeles los angeles
## 40 pho hoa los angeles los angeles
## 41 pink's famous chili dogs los angeles los angeles
## 42 r-23 los angeles los angeles
## 43 rae's los angeles los angeles
## 44 rubin's red hots los angeles los angeles
## 45 ruby's ( la ) los angeles los angeles
## 46 ruth's chris steak house ( los angeles ) los angeles los angeles
## 47 shiro los angeles los angeles
## 48 sushi nozawa los angeles los angeles
## 49 sweet lady jane los angeles los angeles
## 50 tommy's los angeles los angeles
## 51 water grill los angeles los angeles
## 52 afghan kebab house new york new york
## 53 arcadia new york new york
## 54 benny's burritos new york new york
## 55 cafe con leche new york new york
## 56 corner bistro new york new york
## 57 cucina della fontana new york new york
## 58 cucina di pesce new york new york
## 59 darbar new york new york
## 60 ej's luncheonette new york new york
## 61 edison cafe new york new york
## 62 elias corner new york new york
## 63 good enough to eat new york new york
## 64 gray's papaya new york new york
## 65 il mulino new york new york
## 66 jackson diner new york new york
## 67 joe's shanghai new york new york
## 68 john's pizzeria new york new york
## 69 kelley & ping new york new york
## 70 kiev new york new york
## 71 kuruma zushi new york new york
## 72 la caridad new york new york
## 73 la grenouille new york new york
## 74 lemongrass grill new york new york
## 75 lombardi's new york new york
## 76 marnie's noodle shop new york new york
## 77 menchanko-tei new york new york
## 78 mitali east-west new york new york
## 79 monsoon ( ny ) new york new york
## 80 moustache new york new york
## 81 nobu new york new york
## 82 one if by land tibs new york new york
## 83 oyster bar new york new york
## 84 palm new york new york
## 85 palm too new york new york
## 86 patsy's pizza new york new york
## 87 peter luger steak house new york new york
## 88 rose of india new york new york
## 89 sam's noodle shop new york new york
## 90 sarabeth's new york new york
## 91 sparks steak house new york new york
## 92 stick to your ribs new york new york
## 93 sushisay new york new york
## 94 sylvia's new york new york
## 95 szechuan hunan cottage new york new york
## 96 szechuan kitchen new york new york
## 97 teresa's new york new york
## 98 thai house cafe new york new york
## 99 thailand restaurant new york new york
## 100 veselka new york new york
## 101 westside cottage new york new york
## 102 windows on the world new york new york
## 103 wollensky's grill new york new york
## 104 yama new york new york
## 105 zarela new york new york
## 106 andre's french restaurant las vegas las vegas
## 107 buccaneer bay club las vegas las vegas
## 108 buzio's in the rio las vegas las vegas
## 109 'em eril's new orleans fish house las vegas las vegas
## 110 fiore rotisserie & grille las vegas las vegas
## 111 hugo's cellar las vegas las vegas
## 112 madame ching's las vegas las vegas
## 113 mayflower cuisinier las vegas las vegas
## 114 michael's ( las vegas ) las vegas las vegas
## 115 monte carlo las vegas las vegas
## 116 moongate las vegas las vegas
## 117 morton's of chicago ( las vegas ) las vegas las vegas
## 118 nicky blair's las vegas las vegas
## 119 piero's restaurant las vegas las vegas
## 120 spago ( las vegas ) las vegas las vegas
## 121 steakhouse the las vegas las vegas
## 122 stefano's las vegas las vegas
## 123 sterling brunch las vegas las vegas
## 124 tre visi las vegas las vegas
## 125 ' 103 west atlanta atlanta
## 126 alon's at the terrace atlanta atlanta
## 127 baker's cajun cafe atlanta atlanta
## 128 barbecue kitchen atlanta atlanta
## 129 bistro the atlanta atlanta
## 130 bobby & june's kountry kitchen atlanta atlanta
## 131 bradshaw's restaurant atlanta atlanta
## 132 brookhaven cafe atlanta atlanta
## 133 cafe sunflower atlanta atlanta
## 134 canoe atlanta atlanta
## 135 carey's atlanta atlanta
## 136 carey's corner atlanta atlanta
## 137 chops atlanta atlanta
## 138 chopstix atlanta atlanta
## 139 deacon burton's soulfood restaurant atlanta atlanta
## 140 eats atlanta atlanta
## 141 flying biscuit the atlanta atlanta
## 142 frijoleros atlanta atlanta
## 143 greenwood's atlanta atlanta
## 144 harold's barbecue atlanta atlanta
## 145 havana sandwich shop atlanta atlanta
## 146 indian delights atlanta atlanta
## 147 java jive atlanta atlanta
## 148 johnny rockets ( at ) atlanta atlanta
## 149 kalo's coffee house atlanta atlanta
## 150 la fonda latina atlanta atlanta
## 151 lettuce souprise you ( at ) atlanta atlanta
## 152 majestic atlanta atlanta
## 153 morton's of chicago ( atlanta ) atlanta atlanta
## 154 my thai atlanta atlanta
## 155 nava atlanta atlanta
## 156 nuevo laredo cantina atlanta atlanta
## 157 original pancake house ( at ) atlanta atlanta
## 158 palm the ( atlanta ) atlanta atlanta
## 159 rainbow restaurant atlanta atlanta
## 160 riviera atlanta atlanta
## 161 silver skillet the atlanta atlanta
## 162 soto atlanta atlanta
## 163 thelma's kitchen atlanta atlanta
## 164 tortillas atlanta atlanta
## 165 van gogh's restaurant & bar atlanta atlanta
## 166 veggieland atlanta atlanta
## 167 white house restaurant atlanta atlanta
## 168 bill's place san francisco san francisco
## 169 cafe flore san francisco san francisco
## 170 caffe greco san francisco san francisco
## 171 campo santo san francisco san francisco
## 172 cha cha cha's san francisco san francisco
## 173 doidge's san francisco san francisco
## 174 dottie's true blue cafe san francisco san francisco
## 175 dusit thai san francisco san francisco
## 176 ebisu san francisco san francisco
## 177 'em erald garden restaurant san francisco san francisco
## 178 eric's chinese restaurant san francisco san francisco
## 179 hamburger mary's san francisco san francisco
## 180 kelly's on trinity san francisco san francisco
## 181 la cumbre san francisco san francisco
## 182 la mediterranee san francisco san francisco
## 183 la taqueria san francisco san francisco
## 184 mario's bohemian cigar store cafe san francisco san francisco
## 185 marnee thai san francisco san francisco
## 186 mel's drive-in san francisco san francisco
## 187 mo's burgers san francisco san francisco
## 188 phnom penh cambodian restaurant san francisco san francisco
## 189 roosevelt tamale parlor san francisco san francisco
## 190 sally's cafe & bakery san francisco san francisco
## 191 san francisco bbq san francisco san francisco
## 192 slanted door san francisco san francisco
## 193 swan oyster depot san francisco san francisco
## 194 thep phanom san francisco san francisco
## 195 ti couz san francisco san francisco
## 196 trio cafe san francisco san francisco
## 197 tu lan san francisco san francisco
## 198 vicolo pizzeria san francisco san francisco
## 199 wa-ha-ka oaxaca mexican grill san francisco san francisco
## 200 arnie morton's of chicago los angeles los angeles
## 201 art's deli los angeles los angeles
## 202 bel-air hotel los angeles los angeles
## 203 campanile los angeles los angeles
## 204 chinois on main los angeles los angeles
## 205 citrus los angeles los angeles
## 206 fenix at the argyle los angeles los angeles
## 207 granita los angeles los angeles
## 208 grill the los angeles los angeles
## 209 l ` orangerie los angeles los angeles
## 210 le chardonnay ( los angeles ) los angeles los angeles
## 211 locanda veneta los angeles los angeles
## 212 matsuhisa los angeles los angeles
## 213 palm the ( los angeles ) los angeles los angeles
## 214 patina los angeles los angeles
## 215 philippe the original los angeles los angeles
## 216 pinot bistro los angeles los angeles
## 217 rex il ristorante los angeles los angeles
## 218 spago ( los angeles ) los angeles los angeles
## 219 valentino los angeles los angeles
## 220 yujean kang's los angeles los angeles
## 221 '21 club new york new york
## 222 aquavit new york new york
## 223 aureole new york new york
## 224 cafe lalo new york new york
## 225 cafe des artistes new york new york
## 226 carmine's new york new york
## 227 carnegie deli new york new york
## 228 chanterelle new york new york
## 229 daniel new york new york
## 230 dawat new york new york
## 231 felidia new york new york
## 232 four seasons new york new york
## 233 gotham bar & grill new york new york
## 234 gramercy tavern new york new york
## 235 island spice new york new york
## 236 jo jo new york new york
## 237 la caravelle new york new york
## 238 la cote basque new york new york
## 239 le bernardin new york new york
## 240 les celebrites new york new york
## 241 lespinasse ( new york city ) new york new york
## 242 lutece new york new york
## 243 manhattan ocean club new york new york
## 244 march new york new york
## 245 mesa grill new york new york
## 246 mi cocina new york new york
## 247 montrachet new york new york
## 248 oceana new york new york
## 249 park avenue cafe ( new york city ) new york new york
## 250 petrossian new york new york
## 251 picholine new york new york
## 252 pisces new york new york
## 253 rainbow room new york new york
## 254 river cafe new york new york
## 255 san domenico new york new york
## 256 second avenue deli new york new york
## 257 seryna new york new york
## 258 shun lee palace new york new york
## 259 sign of the dove new york new york
## 260 smith & wollensky new york new york
## 261 tavern on the green new york new york
## 262 uncle nick's new york new york
## 263 union square cafe new york new york
## 264 virgil's real bbq new york new york
## 265 chin's las vegas las vegas
## 266 coyote cafe ( las vegas ) las vegas las vegas
## 267 le montrachet bistro las vegas las vegas
## 268 palace court las vegas las vegas
## 269 second street grill las vegas las vegas
## 270 steak house the las vegas las vegas
## 271 'till erman the las vegas las vegas
## 272 abruzzi atlanta atlanta
## 273 bacchanalia atlanta atlanta
## 274 bone's restaurant atlanta atlanta
## 275 brasserie le coze atlanta atlanta
## 276 buckhead diner atlanta atlanta
## 277 ciboulette restaurant atlanta atlanta
## 278 delectables atlanta atlanta
## 279 georgia grille atlanta atlanta
## 280 hedgerose heights inn the atlanta atlanta
## 281 heera of india atlanta atlanta
## 282 indigo coastal grill atlanta atlanta
## 283 la grotta atlanta atlanta
## 284 mary mac's tea room atlanta atlanta
## 285 nikolai's roof atlanta atlanta
## 286 pano's & paul 's atlanta atlanta
## 287 ritz-carlton cafe ( buckhead ) atlanta atlanta
## 288 ritz-carlton dining room ( buckhead ) atlanta atlanta
## 289 ritz-carlton restaurant atlanta atlanta
## 290 toulouse atlanta atlanta
## 291 veni vidi vici atlanta atlanta
## 292 alain rondelli san francisco san francisco
## 293 aqua san francisco san francisco
## 294 boulevard san francisco san francisco
## 295 cafe claude san francisco san francisco
## 296 campton place san francisco san francisco
## 297 chez michel san francisco san francisco
## 298 fleur de lys san francisco san francisco
## 299 fringale san francisco san francisco
## 300 hawthorne lane san francisco san francisco
## 301 khan toke thai house san francisco san francisco
## 302 la folie san francisco san francisco
## 303 lulu restaurant-bis-cafe san francisco san francisco
## 304 masa's san francisco san francisco
## 305 mifune san francisco san francisco
## 306 plumpjack cafe san francisco san francisco
## 307 postrio san francisco san francisco
## 308 ritz-carlton dining room ( san francisco ) san francisco san francisco
## 309 rose pistola san francisco san francisco
## 310 ritz-carlton cafe ( atlanta ) atlanta atlanta
Fabulous fixing! Now that you’ve created consistent spelling for each city, it will be much easier to compute summary statistics by city.
= longest common subsequence method/function
s: lcs()
, jaccard()
, jaro_winkler()
Pair blocking
Zagat and Fodor’s are both companies that gather restaurant reviews. The zagat
and fodors
datasets both contain information about various restaurants, including addresses, phone numbers, and cuisine types. Some restaurants appear in both datasets, but don’t necessarily have the same exact name or phone number written down. In this chapter, you’ll work towards figuring out which restaurants appear in both datasets.
The first step towards this goal is to generate pairs of records so that you can compare them. In this exercise, you’ll first generate all possible pairs, and then use your newly-cleaned city
column as a blocking variable.
and fodors
are available.
fodors <- readRDS("_data/fodors.rds")
# Generate all possible pairs
pair_blocking(zagat, fodors)
## Simple blocking
## No blocking used.
## First data set: 310 records
## Second data set: 533 records
## Total number of pairs: 165 230 pairs
## ldat with 165 230 rows and 2 columns
## x y
## 1 1 1
## 2 2 1
## 3 3 1
## 4 4 1
## 5 5 1
## 6 6 1
## 7 7 1
## 8 8 1
## 9 9 1
## 10 10 1
## : : :
## 165221 301 533
## 165222 302 533
## 165223 303 533
## 165224 304 533
## 165225 305 533
## 165226 306 533
## 165227 307 533
## 165228 308 533
## 165229 309 533
## 165230 310 533
# Generate pairs with same city
pair_blocking(zagat, fodors, blocking_var = "city")
## Simple blocking
## Blocking variable(s): city
## First data set: 310 records
## Second data set: 533 records
## Total number of pairs: 40 532 pairs
## ldat with 40 532 rows and 2 columns
## x y
## 1 1 1
## 2 1 2
## 3 1 3
## 4 1 4
## 5 1 5
## 6 1 6
## 7 1 7
## 8 1 8
## 9 1 9
## 10 1 10
## : : :
## 40523 310 414
## 40524 310 415
## 40525 310 416
## 40526 310 417
## 40527 310 418
## 40528 310 419
## 40529 310 420
## 40530 310 421
## 40531 310 422
## 40532 310 423
Perfect pairings! By using city
as a blocking variable, you were able to reduce the number of pairs you’ll need to compare from 165,230 pairs to 40,532.
Comparing pairs
Now that you’ve generated the pairs of restaurants, it’s time to compare them. You can easily customize how you perform your comparisons using the by
and default_comparator
arguments. There’s no right answer as to what each should be set to, so in this exercise, you’ll try a couple options out.
and reclin
are loaded and zagat
and fodors
are available.
# Generate pairs
pair_blocking(zagat, fodors, blocking_var = "city") %>%
# Compare pairs by name using lcs()
compare_pairs(by = "name",
default_comparator = lcs())
## Compare
## By: name
## Simple blocking
## Blocking variable(s): city
## First data set: 310 records
## Second data set: 533 records
## Total number of pairs: 40 532 pairs
## ldat with 40 532 rows and 3 columns
## x y name
## 1 1 1 0.3157895
## 2 1 2 0.3225806
## 3 1 3 0.2307692
## 4 1 4 0.2608696
## 5 1 5 0.4545455
## 6 1 6 0.2142857
## 7 1 7 0.1052632
## 8 1 8 0.2222222
## 9 1 9 0.3000000
## 10 1 10 0.4516129
## : : : :
## 40523 310 414 0.3606557
## 40524 310 415 0.2631579
## 40525 310 416 0.2105263
## 40526 310 417 0.3750000
## 40527 310 418 0.2978723
## 40528 310 419 0.2727273
## 40529 310 420 0.3437500
## 40530 310 421 0.3414634
## 40531 310 422 0.4081633
## 40532 310 423 0.1714286
# Generate pairs
pair_blocking(zagat, fodors, blocking_var = "city") %>%
# Compare pairs by name, phone, addr
compare_pairs(by = c("name", "phone", "addr"),
default_comparator = jaro_winkler())
## Compare
## By: name, phone, addr
## Simple blocking
## Blocking variable(s): city
## First data set: 310 records
## Second data set: 533 records
## Total number of pairs: 40 532 pairs
## ldat with 40 532 rows and 5 columns
## x y name phone addr
## 1 1 1 0.4871062 0.6746032 0.5703661
## 2 1 2 0.5234025 0.5555556 0.6140351
## 3 1 3 0.4564103 0.7222222 0.5486355
## 4 1 4 0.5102564 0.6746032 0.6842105
## 5 1 5 0.5982906 0.5793651 0.5515351
## 6 1 6 0.3581197 0.6746032 0.4825911
## 7 1 7 0.0000000 0.6269841 0.5457762
## 8 1 8 0.4256410 0.6269841 0.4979621
## 9 1 9 0.5013736 0.7777778 0.6342105
## 10 1 10 0.6011396 0.6746032 0.4654971
## : : : : : :
## 40523 310 414 0.4972291 0.6666667 0.5158263
## 40524 310 415 0.5778143 0.6746032 0.5065359
## 40525 310 416 0.4426564 0.6666667 0.4294118
## 40526 310 417 0.5315404 0.7152778 0.7070387
## 40527 310 418 0.5271102 0.6111111 0.7135914
## 40528 310 419 0.5204981 0.6944444 0.5683007
## 40529 310 420 0.5635103 0.5833333 0.4928843
## 40530 310 421 0.4891899 0.6111111 0.6108883
## 40531 310 422 0.6204433 0.6746032 0.7774510
## 40532 310 423 0.4233716 0.6746032 0.7908497
Crafty comparisons! Choosing a comparator and the columns to compare is highly dataset-dependent, so it’s best to try out different combinations to see which works best on the dataset you’re working with. Next, you’ll build on your string comparison skills and learn about record linkage!
Putting it together
During this chapter, you’ve cleaned up the city
column of zagat
using string similarity, as well as generated and compared pairs of restaurants from zagat
and fodors
. The end is near - all that’s left to do is score and select pairs and link the data together, and you’ll be able to begin your analysis in no time!
reclin and dplyr
are loaded and zagat
and fodors
are available.
# Create pairs
pair_blocking(zagat, fodors, blocking_var = "city") %>%
# Compare pairs
compare_pairs(by = "name", default_comparator = jaro_winkler()) %>%
# Score pairs
## Compare
## By: name
## Simple blocking
## Blocking variable(s): city
## First data set: 310 records
## Second data set: 533 records
## Total number of pairs: 40 532 pairs
## ldat with 40 532 rows and 4 columns
## x y name weight
## 1 1 1 0.4871062 -0.018054756
## 2 1 2 0.5234025 0.034349215
## 3 1 3 0.4564103 -0.058771317
## 4 1 4 0.5102564 0.014794851
## 5 1 5 0.5982906 0.160497213
## 6 1 6 0.3581197 -0.171215199
## 7 1 7 0.0000000 -0.440170787
## 8 1 8 0.4256410 -0.096683808
## 9 1 9 0.5013736 0.001958745
## 10 1 10 0.6011396 0.165868942
## : : : : :
## 40523 310 414 0.4972291 -0.003930282
## 40524 310 415 0.5778143 0.123235782
## 40525 310 416 0.4426564 -0.076056611
## 40526 310 417 0.5315404 0.046802575
## 40527 310 418 0.5271102 0.039989118
## 40528 310 419 0.5204981 0.029970093
## 40529 310 420 0.5635103 0.098522838
## 40530 310 421 0.4891899 -0.015176894
## 40531 310 422 0.6204433 0.203563939
## 40532 310 423 0.4233716 -0.099374214
# Create pairs
pair_blocking(zagat, fodors, blocking_var = "city") %>%
# Compare pairs
compare_pairs(by = "name", default_comparator = jaro_winkler()) %>%
# Score pairs
score_problink() %>%
# Select pairs
## Compare
## By: name
## Simple blocking
## Blocking variable(s): city
## First data set: 310 records
## Second data set: 533 records
## Total number of pairs: 40 532 pairs
## ldat with 40 532 rows and 5 columns
## x y name weight select
## 1 1 1 0.4871062 -0.018054756 FALSE
## 2 1 2 0.5234025 0.034349215 FALSE
## 3 1 3 0.4564103 -0.058771317 FALSE
## 4 1 4 0.5102564 0.014794851 FALSE
## 5 1 5 0.5982906 0.160497213 FALSE
## 6 1 6 0.3581197 -0.171215199 FALSE
## 7 1 7 0.0000000 -0.440170787 FALSE
## 8 1 8 0.4256410 -0.096683808 FALSE
## 9 1 9 0.5013736 0.001958745 FALSE
## 10 1 10 0.6011396 0.165868942 FALSE
## : : : : : :
## 40523 310 414 0.4972291 -0.003930282 FALSE
## 40524 310 415 0.5778143 0.123235782 FALSE
## 40525 310 416 0.4426564 -0.076056611 FALSE
## 40526 310 417 0.5315404 0.046802575 FALSE
## 40527 310 418 0.5271102 0.039989118 FALSE
## 40528 310 419 0.5204981 0.029970093 FALSE
## 40529 310 420 0.5635103 0.098522838 FALSE
## 40530 310 421 0.4891899 -0.015176894 FALSE
## 40531 310 422 0.6204433 0.203563939 FALSE
## 40532 310 423 0.4233716 -0.099374214 FALSE
# Create pairs
pair_blocking(zagat, fodors, blocking_var = "city") %>%
# Compare pairs
compare_pairs(by = "name", default_comparator = jaro_winkler()) %>%
# Score pairs
score_problink() %>%
# Select pairs
select_n_to_m() %>%
# Link data
## id.x name.x
## 1 0 apple pan the
## 2 1 asahi ramen
## 3 2 baja fresh
## 4 3 belvedere the
## 5 4 benita's frites
## 6 5 bernard's
## 7 6 bistro 45
## 8 8 brighton coffee shop
## 9 9 bristol farms market cafe
## 10 11 cafe'50s
## 11 12 cafe blanc
## 12 13 cassell's
## 13 15 diaghilev
## 14 16 don antonio's
## 15 17 duke's
## 16 18 falafel king
## 17 19 feast from the east
## 18 20 gumbo pot the
## 19 22 indo cafe
## 20 23 jan's family restaurant
## 21 24 jiraffe
## 22 25 jody maroni's sausage kingdom
## 23 26 joe's
## 24 27 john o ` groats
## 25 30 johnny rockets ( la )
## 26 31 killer shrimp
## 27 32 kokomo cafe
## 28 33 koo koo roo
## 29 35 la salsa ( la )
## 30 37 langer's
## 31 38 local nochol
## 32 40 mani's bakery & espresso bar
## 33 43 michael's ( los angeles )
## 34 44 mishima
## 35 45 mo better meatty meat
## 36 46 mulberry st.
## 37 47 ocean park cafe
## 38 49 original pantry bakery
## 39 50 parkway grill
## 40 51 pho hoa
## 41 52 pink's famous chili dogs
## 42 55 rae's
## 43 56 rubin's red hots
## 44 57 ruby's ( la )
## 45 59 ruth's chris steak house ( los angeles )
## 46 60 shiro
## 47 61 sushi nozawa
## 48 62 sweet lady jane
## 49 64 tommy's
## 50 66 water grill
## 51 68 afghan kebab house
## 52 69 arcadia
## 53 70 benny's burritos
## 54 71 cafe con leche
## 55 72 corner bistro
## 56 73 cucina della fontana
## 57 74 cucina di pesce
## 58 75 darbar
## 59 76 ej's luncheonette
## 60 77 edison cafe
## 61 78 elias corner
## 62 79 good enough to eat
## 63 80 gray's papaya
## 64 81 il mulino
## 65 82 jackson diner
## 66 83 joe's shanghai
## 67 84 john's pizzeria
## 68 85 kelley & ping
## 69 86 kiev
## 70 87 kuruma zushi
## 71 88 la caridad
## 72 89 la grenouille
## 73 90 lemongrass grill
## 74 91 lombardi's
## 75 92 marnie's noodle shop
## 76 93 menchanko-tei
## 77 94 mitali east-west
## 78 95 monsoon ( ny )
## 79 96 moustache
## 80 97 nobu
## 81 98 one if by land tibs
## 82 99 oyster bar
## 83 100 palm
## 84 101 palm too
## 85 102 patsy's pizza
## 86 103 peter luger steak house
## 87 104 rose of india
## 88 105 sam's noodle shop
## 89 106 sarabeth's
## 90 107 sparks steak house
## 91 108 stick to your ribs
## 92 109 sushisay
## 93 110 sylvia's
## 94 111 szechuan hunan cottage
## 95 112 szechuan kitchen
## 96 113 teresa's
## 97 114 thai house cafe
## 98 115 thailand restaurant
## 99 116 veselka
## 100 117 westside cottage
## 101 118 windows on the world
## 102 119 wollensky's grill
## 103 120 yama
## 104 121 zarela
## 105 122 andre's french restaurant
## 106 123 buccaneer bay club
## 107 124 buzio's in the rio
## 108 125 'em eril's new orleans fish house
## 109 126 fiore rotisserie & grille
## 110 127 hugo's cellar
## 111 128 madame ching's
## 112 129 mayflower cuisinier
## 113 130 michael's ( las vegas )
## 114 131 monte carlo
## 115 132 moongate
## 116 133 morton's of chicago ( las vegas )
## 117 134 nicky blair's
## 118 135 piero's restaurant
## 119 136 spago ( las vegas )
## 120 137 steakhouse the
## 121 138 stefano's
## 122 139 sterling brunch
## 123 140 tre visi
## 124 142 alon's at the terrace
## 125 143 baker's cajun cafe
## 126 144 barbecue kitchen
## 127 145 bistro the
## 128 146 bobby & june's kountry kitchen
## 129 147 bradshaw's restaurant
## 130 148 brookhaven cafe
## 131 149 cafe sunflower
## 132 150 canoe
## 133 151 carey's
## 134 152 carey's corner
## 135 153 chops
## 136 154 chopstix
## 137 155 deacon burton's soulfood restaurant
## 138 156 eats
## 139 157 flying biscuit the
## 140 158 frijoleros
## 141 159 greenwood's
## 142 160 harold's barbecue
## 143 161 havana sandwich shop
## 144 163 indian delights
## 145 164 java jive
## 146 165 johnny rockets ( at )
## 147 166 kalo's coffee house
## 148 167 la fonda latina
## 149 168 lettuce souprise you ( at )
## 150 169 majestic
## 151 170 morton's of chicago ( atlanta )
## 152 171 my thai
## 153 172 nava
## 154 173 nuevo laredo cantina
## 155 174 original pancake house ( at )
## 156 175 palm the ( atlanta )
## 157 176 rainbow restaurant
## 158 177 riviera
## 159 178 silver skillet the
## 160 179 soto
## 161 180 thelma's kitchen
## 162 181 tortillas
## 163 182 van gogh's restaurant & bar
## 164 183 veggieland
## 165 184 white house restaurant
## 166 186 bill's place
## 167 187 cafe flore
## 168 188 caffe greco
## 169 189 campo santo
## 170 190 cha cha cha's
## 171 191 doidge's
## 172 192 dottie's true blue cafe
## 173 193 dusit thai
## 174 194 ebisu
## 175 195 'em erald garden restaurant
## 176 196 eric's chinese restaurant
## 177 197 hamburger mary's
## 178 198 kelly's on trinity
## 179 199 la cumbre
## 180 200 la mediterranee
## 181 201 la taqueria
## 182 202 mario's bohemian cigar store cafe
## 183 203 marnee thai
## 184 204 mel's drive-in
## 185 205 mo's burgers
## 186 206 phnom penh cambodian restaurant
## 187 207 roosevelt tamale parlor
## 188 208 sally's cafe & bakery
## 189 209 san francisco bbq
## 190 210 slanted door
## 191 211 swan oyster depot
## 192 212 thep phanom
## 193 213 ti couz
## 194 214 trio cafe
## 195 215 tu lan
## 196 216 vicolo pizzeria
## 197 217 wa-ha-ka oaxaca mexican grill
## 198 218 arnie morton's of chicago
## 199 219 art's deli
## 200 220 bel-air hotel
## 201 222 campanile
## 202 223 chinois on main
## 203 224 citrus
## 204 225 fenix at the argyle
## 205 226 granita
## 206 227 grill the
## 207 229 l ` orangerie
## 208 230 le chardonnay ( los angeles )
## 209 231 locanda veneta
## 210 232 matsuhisa
## 211 233 palm the ( los angeles )
## 212 234 patina
## 213 235 philippe the original
## 214 236 pinot bistro
## 215 237 rex il ristorante
## 216 238 spago ( los angeles )
## 217 239 valentino
## 218 240 yujean kang's
## 219 241 '21 club
## 220 242 aquavit
## 221 243 aureole
## 222 244 cafe lalo
## 223 245 cafe des artistes
## 224 246 carmine's
## 225 247 carnegie deli
## 226 248 chanterelle
## 227 249 daniel
## 228 250 dawat
## 229 251 felidia
## 230 252 four seasons
## 231 253 gotham bar & grill
## 232 254 gramercy tavern
## 233 255 island spice
## 234 256 jo jo
## 235 257 la caravelle
## 236 258 la cote basque
## 237 259 le bernardin
## 238 260 les celebrites
## 239 261 lespinasse ( new york city )
## 240 262 lutece
## 241 263 manhattan ocean club
## 242 264 march
## 243 265 mesa grill
## 244 266 mi cocina
## 245 267 montrachet
## 246 268 oceana
## 247 269 park avenue cafe ( new york city )
## 248 270 petrossian
## 249 271 picholine
## 250 272 pisces
## 251 273 rainbow room
## 252 274 river cafe
## 253 275 san domenico
## 254 276 second avenue deli
## 255 277 seryna
## 256 278 shun lee palace
## 257 279 sign of the dove
## 258 280 smith & wollensky
## 259 281 tavern on the green
## 260 282 uncle nick's
## 261 283 union square cafe
## 262 284 virgil's real bbq
## 263 285 chin's
## 264 286 coyote cafe ( las vegas )
## 265 287 le montrachet bistro
## 266 288 palace court
## 267 289 second street grill
## 268 290 steak house the
## 269 291 'till erman the
## 270 292 abruzzi
## 271 293 bacchanalia
## 272 294 bone's restaurant
## 273 295 brasserie le coze
## 274 296 buckhead diner
## 275 297 ciboulette restaurant
## 276 298 delectables
## 277 299 georgia grille
## 278 300 hedgerose heights inn the
## 279 301 heera of india
## 280 302 indigo coastal grill
## 281 303 la grotta
## 282 304 mary mac's tea room
## 283 305 nikolai's roof
## 284 306 pano's & paul 's
## 285 307 ritz-carlton cafe ( buckhead )
## 286 308 ritz-carlton dining room ( buckhead )
## 287 309 ritz-carlton restaurant
## 288 310 toulouse
## 289 311 veni vidi vici
## 290 312 alain rondelli
## 291 313 aqua
## 292 314 boulevard
## 293 315 cafe claude
## 294 316 campton place
## 295 317 chez michel
## 296 318 fleur de lys
## 297 319 fringale
## 298 320 hawthorne lane
## 299 321 khan toke thai house
## 300 322 la folie
## 301 323 lulu restaurant-bis-cafe
## 302 324 masa's
## 303 325 mifune
## 304 326 plumpjack cafe
## 305 327 postrio
## 306 328 ritz-carlton dining room ( san francisco )
## 307 329 rose pistola
## 308 330 ritz-carlton cafe ( atlanta )
## 309 54 r-23
## 310 141 ' 103 west
## 311 NA <NA>
## 312 NA <NA>
## 313 NA <NA>
## 314 NA <NA>
## 315 NA <NA>
## 316 NA <NA>
## 317 NA <NA>
## 318 NA <NA>
## 319 NA <NA>
## 320 NA <NA>
## 321 NA <NA>
## 322 NA <NA>
## 323 NA <NA>
## 324 NA <NA>
## 325 NA <NA>
## 326 NA <NA>
## 327 NA <NA>
## 328 NA <NA>
## 329 NA <NA>
## 330 NA <NA>
## 331 NA <NA>
## 332 NA <NA>
## 333 NA <NA>
## 334 NA <NA>
## 335 NA <NA>
## 336 NA <NA>
## 337 NA <NA>
## 338 NA <NA>
## 339 NA <NA>
## 340 NA <NA>
## 341 NA <NA>
## 342 NA <NA>
## 343 NA <NA>
## 344 NA <NA>
## 345 NA <NA>
## 346 NA <NA>
## 347 NA <NA>
## 348 NA <NA>
## 349 NA <NA>
## 350 NA <NA>
## 351 NA <NA>
## 352 NA <NA>
## 353 NA <NA>
## 354 NA <NA>
## 355 NA <NA>
## 356 NA <NA>
## 357 NA <NA>
## 358 NA <NA>
## 359 NA <NA>
## 360 NA <NA>
## 361 NA <NA>
## 362 NA <NA>
## 363 NA <NA>
## 364 NA <NA>
## 365 NA <NA>
## 366 NA <NA>
## 367 NA <NA>
## 368 NA <NA>
## 369 NA <NA>
## 370 NA <NA>
## 371 NA <NA>
## 372 NA <NA>
## 373 NA <NA>
## 374 NA <NA>
## 375 NA <NA>
## 376 NA <NA>
## 377 NA <NA>
## 378 NA <NA>
## 379 NA <NA>
## 380 NA <NA>
## 381 NA <NA>
## 382 NA <NA>
## 383 NA <NA>
## 384 NA <NA>
## 385 NA <NA>
## 386 NA <NA>
## 387 NA <NA>
## 388 NA <NA>
## 389 NA <NA>
## 390 NA <NA>
## 391 NA <NA>
## 392 NA <NA>
## 393 NA <NA>
## 394 NA <NA>
## 395 NA <NA>
## 396 NA <NA>
## 397 NA <NA>
## 398 NA <NA>
## 399 NA <NA>
## 400 NA <NA>
## 401 NA <NA>
## 402 NA <NA>
## 403 NA <NA>
## 404 NA <NA>
## 405 NA <NA>
## 406 NA <NA>
## 407 NA <NA>
## 408 NA <NA>
## 409 NA <NA>
## 410 NA <NA>
## 411 NA <NA>
## 412 NA <NA>
## 413 NA <NA>
## 414 NA <NA>
## 415 NA <NA>
## 416 NA <NA>
## 417 NA <NA>
## 418 NA <NA>
## 419 NA <NA>
## 420 NA <NA>
## 421 NA <NA>
## 422 NA <NA>
## 423 NA <NA>
## 424 NA <NA>
## 425 NA <NA>
## 426 NA <NA>
## 427 NA <NA>
## 428 NA <NA>
## 429 NA <NA>
## 430 NA <NA>
## 431 NA <NA>
## 432 NA <NA>
## 433 NA <NA>
## 434 NA <NA>
## 435 NA <NA>
## 436 NA <NA>
## 437 NA <NA>
## 438 NA <NA>
## 439 NA <NA>
## 440 NA <NA>
## 441 NA <NA>
## 442 NA <NA>
## 443 NA <NA>
## 444 NA <NA>
## 445 NA <NA>
## 446 NA <NA>
## 447 NA <NA>
## 448 NA <NA>
## 449 NA <NA>
## 450 NA <NA>
## 451 NA <NA>
## 452 NA <NA>
## 453 NA <NA>
## 454 NA <NA>
## 455 NA <NA>
## 456 NA <NA>
## 457 NA <NA>
## 458 NA <NA>
## 459 NA <NA>
## 460 NA <NA>
## 461 NA <NA>
## 462 NA <NA>
## 463 NA <NA>
## 464 NA <NA>
## 465 NA <NA>
## 466 NA <NA>
## 467 NA <NA>
## 468 NA <NA>
## 469 NA <NA>
## 470 NA <NA>
## 471 NA <NA>
## 472 NA <NA>
## 473 NA <NA>
## 474 NA <NA>
## 475 NA <NA>
## 476 NA <NA>
## 477 NA <NA>
## 478 NA <NA>
## 479 NA <NA>
## 480 NA <NA>
## 481 NA <NA>
## 482 NA <NA>
## 483 NA <NA>
## 484 NA <NA>
## 485 NA <NA>
## 486 NA <NA>
## 487 NA <NA>
## 488 NA <NA>
## 489 NA <NA>
## 490 NA <NA>
## 491 NA <NA>
## 492 NA <NA>
## 493 NA <NA>
## 494 NA <NA>
## 495 NA <NA>
## 496 NA <NA>
## 497 NA <NA>
## 498 NA <NA>
## 499 NA <NA>
## 500 NA <NA>
## 501 NA <NA>
## 502 NA <NA>
## 503 NA <NA>
## 504 NA <NA>
## 505 NA <NA>
## 506 NA <NA>
## 507 NA <NA>
## 508 NA <NA>
## 509 NA <NA>
## 510 NA <NA>
## 511 NA <NA>
## 512 NA <NA>
## 513 NA <NA>
## 514 NA <NA>
## 515 NA <NA>
## 516 NA <NA>
## 517 NA <NA>
## 518 NA <NA>
## 519 NA <NA>
## 520 NA <NA>
## 521 NA <NA>
## 522 NA <NA>
## 523 NA <NA>
## 524 NA <NA>
## 525 NA <NA>
## 526 NA <NA>
## 527 NA <NA>
## 528 NA <NA>
## 529 NA <NA>
## 530 NA <NA>
## 531 NA <NA>
## 532 NA <NA>
## 533 NA <NA>
## 534 NA <NA>
## 535 NA <NA>
## addr.x city.x phone.x
## 1 10801 w. pico blvd. los angeles 310-475-3585
## 2 2027 sawtelle blvd. los angeles 310-479-2231
## 3 3345 kimber dr. los angeles 805-498-4049
## 4 9882 little santa monica blvd. los angeles 310-788-2306
## 5 1433 third st. promenade los angeles 310-458-2889
## 6 515 s. olive st. los angeles 213-612-1580
## 7 45 s. mentor ave. los angeles 818-795-2478
## 8 9600 brighton way los angeles 310-276-7732
## 9 1570 rosecrans ave. s. los angeles 310-643-5229
## 10 838 lincoln blvd. los angeles 310-399-1955
## 11 9777 little santa monica blvd. los angeles 310-888-0108
## 12 3266 w. sixth st. los angeles 213-480-8668
## 13 1020 n. san vicente blvd. los angeles 310-854-1111
## 14 1136 westwood blvd. los angeles 310-209-1422
## 15 8909 sunset blvd. los angeles 310-652-3100
## 16 1059 broxton ave. los angeles 310-208-4444
## 17 1949 westwood blvd. los angeles 310-475-0400
## 18 6333 w. third st. los angeles 213-933-0358
## 19 10428 1/2 national blvd. los angeles 310-815-1290
## 20 8424 beverly blvd. los angeles 213-651-2866
## 21 502 santa monica blvd los angeles 310-917-6671
## 22 2011 ocean front walk los angeles 310-306-1995
## 23 1023 abbot kinney blvd. los angeles 310-399-5811
## 24 10516 w. pico blvd. los angeles 310-204-0692
## 25 7507 melrose ave. los angeles 213-651-3361
## 26 4000 colfax ave. los angeles 818-508-1570
## 27 6333 w. third st. los angeles 213-933-0773
## 28 8393 w. beverly blvd. los angeles 213-655-9045
## 29 22800 pch los angeles 310-456-6299
## 30 704 s. alvarado st. los angeles 213-483-8050
## 31 30869 thousand oaks blvd. los angeles 818-706-7706
## 32 519 s. fairfax ave. los angeles 213-938-8800
## 33 1147 third st. los angeles 310-451-0843
## 34 8474 w. third st. los angeles 213-782-0181
## 35 7261 melrose ave. los angeles 213-935-5280
## 36 17040 ventura blvd. los angeles 818-906-8881
## 37 3117 ocean park blvd. los angeles 310-452-5728
## 38 875 s. figueroa st. downtown los angeles 213-627-6879
## 39 510 s. arroyo pkwy . los angeles 818-795-1001
## 40 642 broadway los angeles 213-626-5530
## 41 709 n. la brea ave. los angeles 213-931-4223
## 42 2901 pico blvd. los angeles 310-828-7937
## 43 15322 ventura blvd. los angeles 818-905-6515
## 44 45 s. fair oaks ave. los angeles 818-796-7829
## 45 224 s. beverly dr. los angeles 310-859-8744
## 46 1505 mission st. s. los angeles 818-799-4774
## 47 11288 ventura blvd. los angeles 818-508-7017
## 48 8360 melrose ave. los angeles 213-653-7145
## 49 2575 beverly blvd. los angeles 213-389-9060
## 50 544 s. grand ave. los angeles 213-891-0900
## 51 764 ninth ave. new york 212-307-1612
## 52 21 e. 62nd st. new york 212-223-2900
## 53 93 ave. a new york 212-254-2054
## 54 424 amsterdam ave. new york 212-595-7000
## 55 331 w. fourth st. new york 212-242-9502
## 56 368 bleecker st. new york 212-242-0636
## 57 87 e. fourth st. new york 212-260-6800
## 58 44 w. 56th st. new york 212-432-7227
## 59 432 sixth ave. new york 212-473-5555
## 60 228 w. 47th st. new york 212-840-5000
## 61 24-02 31st st. new york 718-932-1510
## 62 483 amsterdam ave. new york 212-496-0163
## 63 2090 broadway new york 212-799-0243
## 64 86 w. third st. new york 212-673-3783
## 65 37-03 74th st. new york 718-672-1232
## 66 9 pell st. new york 718-539-3838
## 67 48 w. 65th st. new york 212-721-7001
## 68 127 greene st. new york 212-228-1212
## 69 117 second ave. new york 212-674-4040
## 70 2nd fl . new york 212-317-2802
## 71 2199 broadway new york 212-874-2780
## 72 3 e. 52nd st. new york 212-752-1495
## 73 61a seventh ave. new york 718-399-7100
## 74 32 spring st. new york 212-941-7994
## 75 466 hudson st. new york 212-741-3214
## 76 39 w. 55th st. new york 212-247-1585
## 77 296 bleecker st. new york 212-989-1367
## 78 435 amsterdam ave. new york 212-580-8686
## 79 405 atlantic ave. new york 718-852-5555
## 80 105 hudson st. new york 212-219-0500
## 81 17 barrow st. new york 212-228-0822
## 82 ` lower level new york 212-490-6650
## 83 837 second ave. new york 212-687-2953
## 84 840 second ave. new york 212-697-5198
## 85 19 old fulton st. new york 718-858-4300
## 86 178 broadway new york 718-387-7400
## 87 308 e. sixth st. new york 212-533-5011
## 88 411 third ave. new york 212-213-2288
## 89 1295 madison ave. new york 212-410-7335
## 90 210 e. 46th st. new york 212-687-4855
## 91 5-16 51st ave. new york 718-937-3030
## 92 38 e. 51st st. new york 212-755-1780
## 93 328 lenox ave. new york 212-996-0660
## 94 1588 york ave. new york 212-535-5223
## 95 1460 first ave. new york 212-249-4615
## 96 80 montague st. new york 718-520-2910
## 97 151 hudson st. new york 212-334-1085
## 98 106 bayard st. new york 212-349-3132
## 99 144 second ave. new york 212-228-9682
## 100 689 ninth ave. new york 212-245-0800
## 101 107th fl . new york 212-524-7000
## 102 205 e. 49th st. new york 212-753-0444
## 103 122 e. 17th st. new york 212-475-0969
## 104 953 second ave. new york 212-644-6740
## 105 401 s. 6th st. las vegas 702-385-5016
## 106 3300 las vegas blvd. s. las vegas 702-894-7350
## 107 3700 w. flamingo rd. las vegas 702-252-7697
## 108 3799 las vegas blvd. s. las vegas 702-891-7374
## 109 3700 w. flamingo rd. las vegas 702-252-7702
## 110 202 e. fremont st. las vegas 702-385-4011
## 111 3300 las vegas blvd. s. las vegas 702-894-7111
## 112 4750 w. sahara ave. las vegas 702-870-8432
## 113 3595 las vegas blvd. s. las vegas 702-737-7111
## 114 3145 las vegas blvd. s. las vegas 702-733-4524
## 115 3400 las vegas blvd. s. las vegas 702-791-7352
## 116 3200 las vegas blvd. s. las vegas 702-893-0703
## 117 3925 paradise rd. las vegas 702-792-9900
## 118 355 convention center dr. las vegas 702-369-2305
## 119 3500 las vegas blvd. s. las vegas 702-369-6300
## 120 128 e. fremont st. las vegas 702-382-1600
## 121 129 fremont st. las vegas 702-385-7111
## 122 3645 las vegas blvd. s. las vegas 702-739-4651
## 123 3799 las vegas blvd. s. las vegas 702-891-7331
## 124 659 peachtree st. atlanta 404-724-0444
## 125 1134 euclid ave. atlanta 404-223-5039
## 126 1437 virginia ave. atlanta 404-766-9906
## 127 56 e. andrews dr. nw atlanta 404-231-5733
## 128 375 14th st. atlanta 404-876-3872
## 129 2911 s. pharr court atlanta 404-261-7015
## 130 4274 peachtree rd. atlanta 404-231-5907
## 131 5975 roswell rd. atlanta 404-256-1675
## 132 4199 paces ferry rd. atlanta 770-432-2663
## 133 1021 cobb pkwy . se atlanta 770-422-8042
## 134 1215 powers ferry rd. atlanta 770-933-0909
## 135 70 w. paces ferry rd. atlanta 404-262-2675
## 136 4279 roswell rd. atlanta 404-255-4868
## 137 1029 edgewood ave. se atlanta 404-523-1929
## 138 600 ponce de leon ave. atlanta 404-888-9149
## 139 1655 mclendon ave. atlanta 404-687-8888
## 140 1031 peachtree st. ne atlanta 404-892-8226
## 141 1087 green st. atlanta 770-992-5383
## 142 171 mcdonough blvd. atlanta 404-627-9268
## 143 2905 buford hwy . atlanta 404-636-4094
## 144 3675 satellite blvd. atlanta 100-813-8212
## 145 790 ponce de leon ave. atlanta 404-876-6161
## 146 2970 cobb pkwy . atlanta 770-955-6068
## 147 1248 clairmont rd. atlanta 404-325-3733
## 148 4427 roswell rd. atlanta 404-303-8201
## 149 3525 mall blvd. atlanta 770-418-9969
## 150 1031 ponce de leon ave. atlanta 404-875-0276
## 151 303 peachtree st. ne atlanta 404-577-4366
## 152 1248 clairmont rd. atlanta 404-636-4280
## 153 3060 peachtree rd. atlanta 404-240-1984
## 154 1495 chattahoochee ave. nw atlanta 404-352-9009
## 155 4330 peachtree rd. atlanta 404-237-4116
## 156 3391 peachtree rd. ne atlanta 404-814-1955
## 157 2118 n. decatur rd. atlanta 404-633-3538
## 158 519 e. paces ferry rd. atlanta 404-262-7112
## 159 200 14th st. nw atlanta 404-874-1388
## 160 3330 piedmont rd. atlanta 404-233-2005
## 161 764 marietta st. nw atlanta 404-688-5855
## 162 774 ponce de leon ave. ne atlanta 404-892-0193
## 163 70 w. crossville rd. atlanta 770-993-1156
## 164 220 sandy springs circle atlanta 404-231-3111
## 165 3172 peachtree rd. ne atlanta 404-237-7601
## 166 2315 clement st. san francisco 415-221-5262
## 167 2298 market st. san francisco 415-621-8579
## 168 423 columbus ave. san francisco 415-397-6261
## 169 240 columbus ave. san francisco 415-433-9623
## 170 1805 haight st. san francisco 415-386-5758
## 171 2217 union st. san francisco 415-921-2149
## 172 522 jones st. san francisco 415-885-2767
## 173 3221 mission st. san francisco 415-826-4639
## 174 1283 ninth ave. san francisco 415-566-1770
## 175 1550 california st. san francisco 415-673-1155
## 176 1500 church st. san francisco 415-282-0919
## 177 1582 folsom st. san francisco 415-626-1985
## 178 333 bush st. san francisco 415-362-4454
## 179 515 valencia st. san francisco 415-863-8205
## 180 288 noe st. san francisco 415-431-7210
## 181 2889 mission st. san francisco 415-285-7117
## 182 2209 polk st. san francisco 415-776-8226
## 183 2225 irving st. san francisco 415-665-9500
## 184 3355 geary st. san francisco 415-387-2244
## 185 1322 grant st. san francisco 415-788-3779
## 186 631 larkin st. san francisco 415-775-5979
## 187 2817 24th st. san francisco 415-550-9213
## 188 300 de haro st. san francisco 415-626-6006
## 189 1328 18th st. san francisco 415-431-8956
## 190 584 valencia st. san francisco 415-861-8032
## 191 1517 polk st. san francisco 415-673-1101
## 192 400 waller st. san francisco 415-431-2526
## 193 3108 16th st. san francisco 415-252-7373
## 194 1870 fillmore st. san francisco 415-563-2248
## 195 8 sixth st. san francisco 415-626-0927
## 196 201 ivy st. san francisco 415-863-2382
## 197 2141 polk st. san francisco 415-775-1055
## 198 435 s. la cienega blvd. los angeles 310-246-1501
## 199 12224 ventura blvd. los angeles 818-762-1221
## 200 701 stone canyon rd. los angeles 310-472-1211
## 201 624 s. la brea ave. los angeles 213-938-1447
## 202 2709 main st. los angeles 310-392-9025
## 203 6703 melrose ave. los angeles 213-857-0034
## 204 8358 sunset blvd. los angeles 213-848-6677
## 205 23725 w. malibu rd. los angeles 310-456-0488
## 206 9560 dayton way los angeles 310-276-0615
## 207 903 n. la cienega blvd. los angeles 310-652-9770
## 208 8284 melrose ave. los angeles 213-655-8880
## 209 8638 w. third st. los angeles 310-274-1893
## 210 129 n. la cienega blvd. los angeles 310-659-9639
## 211 9001 santa monica blvd. los angeles 310-550-8811
## 212 5955 melrose ave. los angeles 213-467-1108
## 213 1001 n. alameda st. los angeles 213-628-3781
## 214 12969 ventura blvd. los angeles 818-990-0500
## 215 617 s. olive st. los angeles 213-627-2300
## 216 8795 sunset blvd. los angeles 310-652-4025
## 217 3115 pico blvd. los angeles 310-829-4313
## 218 67 n. raymond ave. los angeles 818-585-0855
## 219 21 w. 52nd st. new york 212-582-7200
## 220 13 w. 54th st. new york 212-307-7311
## 221 34 e. 61st st. new york 212-319-1660
## 222 201 w. 83rd st. new york 212-496-6031
## 223 1 w. 67th st. new york 212-877-3500
## 224 2450 broadway new york 212-362-2200
## 225 854 seventh ave. new york 212-757-2245
## 226 2 harrison st. new york 212-966-6960
## 227 20 e. 76th st. new york 212-288-0033
## 228 210 e. 58th st. new york 212-355-7555
## 229 243 e. 58th st. new york 212-758-1479
## 230 99 e. 52nd st. new york 212-754-9494
## 231 12 e. 12th st. new york 212-620-4020
## 232 42 e. 20th st. new york 212-477-0777
## 233 402 w. 44th st. new york 212-765-1737
## 234 160 e. 64th st. new york 212-223-5656
## 235 33 w. 55th st. new york 212-586-4252
## 236 60 w. 55th st. new york 212-688-6525
## 237 155 w. 51st st. new york 212-489-1515
## 238 155 w. 58th st. new york 212-484-5113
## 239 2 e. 55th st. new york 212-339-6719
## 240 249 e. 50th st. new york 212-752-2225
## 241 57 w. 58th st. new york 212-371-7777
## 242 405 e. 58th st. new york 212-754-6272
## 243 102 fifth ave. new york 212-807-7400
## 244 57 jane st. new york 212-627-8273
## 245 239 w. broadway new york 212-219-2777
## 246 55 e. 54th st. new york 212-759-5941
## 247 100 e. 63rd st. new york 212-644-1900
## 248 182 w. 58th st. new york 212-245-2214
## 249 35 w. 64th st. new york 212-724-8585
## 250 95 ave. a new york 212-260-6660
## 251 30 rockefeller plaza new york 212-632-5000
## 252 1 water st. new york 718-522-5200
## 253 240 central park s. new york 212-265-5959
## 254 156 second ave. new york 212-677-0606
## 255 11 e. 53rd st. new york 212-980-9393
## 256 155 e. 55th st. new york 212-371-8844
## 257 1110 third ave. new york 212-861-8080
## 258 797 third ave. new york 212-753-1530
## 259 ` central park west new york 212-873-3200
## 260 747 ninth ave. new york 212-245-7992
## 261 21 e. 16th st. new york 212-243-4020
## 262 152 w. 44th st. new york 212-921-9494
## 263 3200 las vegas blvd. s. las vegas 702-733-8899
## 264 3799 las vegas blvd. s. las vegas 702-891-7349
## 265 3000 paradise rd. las vegas 702-732-5651
## 266 3570 las vegas blvd. s. las vegas 702-731-7110
## 267 200 e. fremont st. las vegas 702-385-6277
## 268 2880 las vegas blvd. s. las vegas 702-734-0410
## 269 2245 e. flamingo rd. las vegas 702-731-4036
## 270 2355 peachtree rd. ne atlanta 404-261-8186
## 271 3125 piedmont rd. atlanta 404-365-0410
## 272 3130 piedmont rd. ne atlanta 404-237-2663
## 273 3393 peachtree rd. atlanta 404-266-1440
## 274 3073 piedmont rd. atlanta 404-262-3336
## 275 1529 piedmont ave. atlanta 404-874-7600
## 276 1 margaret mitchell sq. atlanta 404-681-2909
## 277 2290 peachtree rd. atlanta 404-352-3517
## 278 490 e. paces ferry rd. ne atlanta 404-233-7673
## 279 595 piedmont ave. atlanta 404-876-4408
## 280 1397 n. highland ave. atlanta 404-876-0676
## 281 2637 peachtree rd. ne atlanta 404-231-1368
## 282 224 ponce de leon ave. atlanta 404-876-1800
## 283 255 courtland st. atlanta 404-221-6362
## 284 1232 w. paces ferry rd. atlanta 404-261-3662
## 285 3434 peachtree rd. ne atlanta 404-237-2700
## 286 3434 peachtree rd. ne atlanta 404-237-2700
## 287 181 peachtree st. atlanta 404-659-0400
## 288 293-b peachtree rd. atlanta 404-351-9533
## 289 41 14th st. atlanta 404-875-8424
## 290 126 clement st. san francisco 415-387-0408
## 291 252 california st. san francisco 415-956-9662
## 292 1 mission st. san francisco 415-543-6084
## 293 7 claude ln . san francisco 415-392-3505
## 294 340 stockton st. san francisco 415-955-5555
## 295 804 north point st. san francisco 415-775-7036
## 296 777 sutter st. san francisco 415-673-7779
## 297 570 fourth st. san francisco 415-543-0573
## 298 22 hawthorne st. san francisco 415-777-9779
## 299 5937 geary blvd. san francisco 415-668-6654
## 300 2316 polk st. san francisco 415-776-5577
## 301 816 folsom st. san francisco 415-495-5775
## 302 648 bush st. san francisco 415-989-7154
## 303 1737 post st. san francisco 415-922-0337
## 304 3127 fillmore st. san francisco 415-563-4755
## 305 545 post st. san francisco 415-776-7825
## 306 600 stockton st. san francisco 415-296-7465
## 307 532 columbus ave. san francisco 415-399-0499
## 308 181 peachtree st. atlanta 404-659-0400
## 309 923 e. third st. los angeles 213-687-7178
## 310 103 w. paces ferry rd. atlanta 404-233-5993
## 311 <NA> <NA> <NA>
## 312 <NA> <NA> <NA>
## 313 <NA> <NA> <NA>
## 314 <NA> <NA> <NA>
## 315 <NA> <NA> <NA>
## 316 <NA> <NA> <NA>
## 317 <NA> <NA> <NA>
## 318 <NA> <NA> <NA>
## 319 <NA> <NA> <NA>
## 320 <NA> <NA> <NA>
## 321 <NA> <NA> <NA>
## 322 <NA> <NA> <NA>
## 323 <NA> <NA> <NA>
## 324 <NA> <NA> <NA>
## 325 <NA> <NA> <NA>
## 326 <NA> <NA> <NA>
## 327 <NA> <NA> <NA>
## 328 <NA> <NA> <NA>
## 329 <NA> <NA> <NA>
## 330 <NA> <NA> <NA>
## 331 <NA> <NA> <NA>
## 332 <NA> <NA> <NA>
## 333 <NA> <NA> <NA>
## 334 <NA> <NA> <NA>
## 335 <NA> <NA> <NA>
## 336 <NA> <NA> <NA>
## 337 <NA> <NA> <NA>
## 338 <NA> <NA> <NA>
## 339 <NA> <NA> <NA>
## 340 <NA> <NA> <NA>
## 341 <NA> <NA> <NA>
## 342 <NA> <NA> <NA>
## 343 <NA> <NA> <NA>
## 344 <NA> <NA> <NA>
## 345 <NA> <NA> <NA>
## 346 <NA> <NA> <NA>
## 347 <NA> <NA> <NA>
## 348 <NA> <NA> <NA>
## 349 <NA> <NA> <NA>
## 350 <NA> <NA> <NA>
## 351 <NA> <NA> <NA>
## 352 <NA> <NA> <NA>
## 353 <NA> <NA> <NA>
## 354 <NA> <NA> <NA>
## 355 <NA> <NA> <NA>
## 356 <NA> <NA> <NA>
## 357 <NA> <NA> <NA>
## 358 <NA> <NA> <NA>
## 359 <NA> <NA> <NA>
## 360 <NA> <NA> <NA>
## 361 <NA> <NA> <NA>
## 362 <NA> <NA> <NA>
## 363 <NA> <NA> <NA>
## 364 <NA> <NA> <NA>
## 365 <NA> <NA> <NA>
## 366 <NA> <NA> <NA>
## 367 <NA> <NA> <NA>
## 368 <NA> <NA> <NA>
## 369 <NA> <NA> <NA>
## 370 <NA> <NA> <NA>
## 371 <NA> <NA> <NA>
## 372 <NA> <NA> <NA>
## 373 <NA> <NA> <NA>
## 374 <NA> <NA> <NA>
## 375 <NA> <NA> <NA>
## 376 <NA> <NA> <NA>
## 377 <NA> <NA> <NA>
## 378 <NA> <NA> <NA>
## 379 <NA> <NA> <NA>
## 380 <NA> <NA> <NA>
## 381 <NA> <NA> <NA>
## 382 <NA> <NA> <NA>
## 383 <NA> <NA> <NA>
## 384 <NA> <NA> <NA>
## 385 <NA> <NA> <NA>
## 386 <NA> <NA> <NA>
## 387 <NA> <NA> <NA>
## 388 <NA> <NA> <NA>
## 389 <NA> <NA> <NA>
## 390 <NA> <NA> <NA>
## 391 <NA> <NA> <NA>
## 392 <NA> <NA> <NA>
## 393 <NA> <NA> <NA>
## 394 <NA> <NA> <NA>
## 395 <NA> <NA> <NA>
## 396 <NA> <NA> <NA>
## 397 <NA> <NA> <NA>
## 398 <NA> <NA> <NA>
## 399 <NA> <NA> <NA>
## 400 <NA> <NA> <NA>
## 401 <NA> <NA> <NA>
## 402 <NA> <NA> <NA>
## 403 <NA> <NA> <NA>
## 404 <NA> <NA> <NA>
## 405 <NA> <NA> <NA>
## 406 <NA> <NA> <NA>
## 407 <NA> <NA> <NA>
## 408 <NA> <NA> <NA>
## 409 <NA> <NA> <NA>
## 410 <NA> <NA> <NA>
## 411 <NA> <NA> <NA>
## 412 <NA> <NA> <NA>
## 413 <NA> <NA> <NA>
## 414 <NA> <NA> <NA>
## 415 <NA> <NA> <NA>
## 416 <NA> <NA> <NA>
## 417 <NA> <NA> <NA>
## 418 <NA> <NA> <NA>
## 419 <NA> <NA> <NA>
## 420 <NA> <NA> <NA>
## 421 <NA> <NA> <NA>
## 422 <NA> <NA> <NA>
## 423 <NA> <NA> <NA>
## 424 <NA> <NA> <NA>
## 425 <NA> <NA> <NA>
## 426 <NA> <NA> <NA>
## 427 <NA> <NA> <NA>
## 428 <NA> <NA> <NA>
## 429 <NA> <NA> <NA>
## 430 <NA> <NA> <NA>
## 431 <NA> <NA> <NA>
## 432 <NA> <NA> <NA>
## 433 <NA> <NA> <NA>
Lovely linking! Now that your two datasets are merged, you can use the data to figure out if there are certain characteristics that make a restaurant more likely to be reviewed by Zagat or Fodor’s.